Food inspection is critical to combating foodborne illness by identifying sources of potential outbreak for prevention in advance. According to Centers for Disease Control and Prevention (CDC), roughly 1 in 6 Americans gets sick, 128,000 are hospitalized, and 3,000 die of foodborne diseases each year. As a city that has been doing food inspection to battle foodborne illness ever since 1876, Chicago is among the several first cities to combine data analysis with traditional food inspection method.
This analysis applies logistic regression to predict whether a certain food establishment will fail in a food inspection or not. It is intended to help make the process of food inspection more efficient by both identifying the food establishments that are more likely to fail and providing additional information to make the selection and routing process faster. By using prediction to assist in prioritizing resource allocation, the application proposed is designed as a supplemental tool to help battle foodborne illnesses.
At the very beginning, food inspections were done mostly by checking all food establishments randomly, which was not a very satisfying approach as time and the number of inspectors was very limited. With continuous evolution over the years, the traditional approach for inspection at the moment is to assign beats, or group of restaurants to food inspectors to inspect a few times a year, depending on a restaurant’s assessed risk level and how complex a restaurant’s menu items are, and how likely ingredients are to trigger food poisoning ( the Washington Post, 2014 ). Despite the improvement in the framework of inspection, the selection process still is largely empirical and might be biased because of it.
Based on previous work done by the Department of Public Health in Chicago, we propose a method that integrate 311 complaints, socioeconomic data with previous inspection result to forecast food establishments that are most likely to fail to suggest higher priority of their inspection. Specifically, the expected users of this application are food inspectors.
The following features can be accessed in this app:
The methodology will be discussed in greater detail in section 2.
With adequate data as is listed in section 2.1, this analysis can be applied to different cities that hope to use data analytics to optimize food inspection.
The data needed for conducting this analysis are:
Because the regression method for this project is logistic regression, for exploratory analysis, the following primary analysis are done for different groups of variables.
Tabulation of the dependent variable
Distribution map of the dependent variable
Cross tabulation of categorical variables
Mean and standard deviation of continuous variables
Pearson Correlation among the variables and with the dependent variable
| Value of Dependent Variable | Number of Events | Proportion of Events |
|---|---|---|
| Failed: fail_flag=1 | 11359 | 0.782 |
| Passed: fail_flag=0 | 40862 | 0.218 |
As the numbers of categories for different predictors are different, the complet cross tabulation result is not shown in this markdown. What is shown here is the Chi square test result for all categorical predictor with the dependent variable. It is used to determine whether the distribution of one categorical variable varies with respect to the values of another categorical variable.
| Predictor | p-value for Chi squre test | Significance |
|---|---|---|
| Count of past serious violations | 1.43361025788993e-104 | *** |
| Count of past critical violations | 1.1412668068341e-51 | *** |
| Whether holding consumption activity license | 4.47603559714603e-06 | *** |
| Whether holding a tobacco license | 2.4639533202077e-25 | *** |
| Facility type | 4.45666644942903e-13 | *** |
| Season of the inspection | 1.32088810374702e-09 | *** |
The result indicates significant difference in the fail risk with respect to different values of all the categorical predictors.
The independent sample t-tests are used to examine whether there are significant differences in mean and standard deviation values of different continuous predictors for failed inspections and passed ones.
| Predictor | Value of Dependent Variable | Mean Value | Standard Deviation | p-value of t test | Significance |
|---|---|---|---|---|---|
| Sanitation complaint density | 0 | 23.534 | 23.633 | 1.60981496581572e-19 | *** |
| 1 | 26.033 | 26.662 | 1.60981496581572e-19 | *** | |
| Garbage cart removal density | 0 | 14.333 | 14.014 | 7.046584312027e-45 | *** |
| 1 | 16.485 | 14.496 | 7.046584312027e-45 | *** | |
| Crime density | 0 | 21.574 | 16.978 | 8.20048647353481e-28 | *** |
| 1 | 23.626 | 17.848 | 8.20048647353481e-28 | *** | |
| Elapse time since last visit | 0 | 1.233 | 0.583 | 0.000121481605976858 | *** |
| 1 | 1.257 | 0.586 | 0.000121481605976858 | *** | |
| Age of business license | 0 | 7.252 | 4.660 | 0.0688547988318274 | |
| 1 | 7.343 | 4.738 | 0.0688547988318274 |
The result indicates that there are significant differences in whether an inspection will fail for all continuous predictors but age of business license.
Although Pearson correlation is not the most appropriate indicator to analyze relationship among categorical predictors, it can be used to identify potential multicollinearity.
As can be seen from the correlation plot, there is no severe multicollinearity.
Logistic regression is a predictive analysis approach that is commonly used to describe data and to explain the relationship between one dependent binary variable and other predictors. Specifically, it can transform dummy variables into possibility and further into log value by calculating odds and odds ratios.
Misclassification rate is the ratio of correctly predicted events against the number of total events.
ROC curves are drawn to plot true positive rate (sensitivity) agianst false positive rate (1-specificity). Area under the ROC curve (AUC) is also a measure of prediction accuracy of the model, which evaluates how well a model predicts 1 responses as 1s and 0 responses as 0s. Higher AUCs mean that we can find a cut-off value for which both sensitivity and specificity of the model are relatively high. Possible values of AUC range between 0.5 and 1. Value ranges for excellent, good, fair, poor, and failing models used in this report are 0.90-1, 0.80-0.90, 0.70-0.80, 0.60-0.70, and 0.50-0.60.
ROC curves can be applied to determine the best cut-off value and to examine predictive quality of the model. A optimal cut-off value can be calculated when the ROC curve has the minimum distance from the upper left corner of the graph.
All of the assumptions are met in this analysis.
The model and its result can be seen as follows.
Call:
glm(formula = fail_flag ~ pastSerious + pastCritical + ageAtInspection +
consumption_on_premises_incidental_activity + tobacco + heat_burglary +
heat_sanitation + heat_garbage + Facility_Type + timeSinceLast +
season, family = "binomial", data = test)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8685 -0.2443 -0.2107 -0.1673 2.9198
Coefficients:
Estimate Std. Error z value
(Intercept) -3.3245965 0.1655666 -20.080
pastSerious 4.3003315 0.0813422 52.867
pastCritical 0.1942207 0.0629243 3.087
ageAtInspection 0.0519457 0.0653758 0.795
consumption_on_premises_incidental_activity -0.1495578 0.0728930 -2.052
tobacco 0.0088534 0.1543000 0.057
heat_burglary -0.0059969 0.0019063 -3.146
heat_sanitation 0.0009851 0.0018269 0.539
heat_garbage 0.0155058 0.0024651 6.290
Facility_TypeRestaurant -0.2771983 0.1000000 -2.772
timeSinceLast -0.2252981 0.0612152 -3.680
season -0.0227688 0.0275970 -0.825
Pr(>|z|)
(Intercept) < 2e-16 ***
pastSerious < 2e-16 ***
pastCritical 0.002025 **
ageAtInspection 0.426863
consumption_on_premises_incidental_activity 0.040195 *
tobacco 0.954244
heat_burglary 0.001656 **
heat_sanitation 0.589734
heat_garbage 3.17e-10 ***
Facility_TypeRestaurant 0.005572 **
timeSinceLast 0.000233 ***
season 0.409345
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13221.2 on 12080 degrees of freedom
Residual deviance: 7074.4 on 12069 degrees of freedom
AIC: 7098.4
Number of Fisher Scoring iterations: 6
As can be seen from the result, past records, the density of crime and garbage removal cart, type of facility, time since last inspection, and whether having a consumption activity license or not are the most significant predictors in this model.
As for the relationship of the predictors with the dependent variable, it is a little surprising that the coefficent for crime density is minus, which indicates that the increase in crime density will lower the food inspection failure risk. It is also a little surprising that saniation complaint density and season, which is a proxy of time of inspection, is not a significant predictor.
The optimal cut-off value is calculated where the point on the ROC curve is closest to the left top corner of the ROC curve, which is shown below.
And the optimal cut-off value is: 0.425
For model evaluation, the following analyses are done:
The plot shows that past serious violation has the greatest absolute value of estimated beta coefficient, and a lot greater than the otehr predictors, which indicates the potentiality of overfitting. But it makes sense that past records are very important factors for predicting performance of food inspection.
The AUC value is: 0.905
According to the standard for evaluation, the performance of this model is excellent.
The misclassification rate is: 13.575 %
The mean RMSE of the cross validation result is: 0.258
The standard deviation of RMSE of the cross validation result is: 0.015
The mean MAE of the cross validation result is: 0.133
The standard deviation of MAE of the cross validation result is: 0.011
The generalizability of this model is pretty good, according to the cross validation results.
Note: the alert system mentioned before is difficult to realize at the moment.
This function can be realized using network analysis.
This function can be realized using cluster analysis.
From the analyses result, this app fulfills its purpose well as an assistant tool for making better decisions on food inspection site selection, for both its accuracy and generalizability are satisfactory. As for the features designed which are both easy-to-use and powerful, its functionality can be best evaluated only with feedbacks or comments from the target users. From designers’ standpoint, we like these features.
For the prediction model, it might be an issue that past records play a far more important role in the than the sanitation complaints from 311 services requests, and possibly other real-time proxies for food inspection failure, which can make the model biased.
Meanwhile, as we were unable to access the weather data of Chicago, we did not include any predictor for that. It is the same with the data of what each food establishment uses for ingredients as well as the food source, which might not be available.
The concurrence of food inspection failures at different time period might also offer more insights into the issue, which we did not examine.
In correspondence to the limitations discussed in section 5.2, it might be necessary to put more efforts into considering factors at the moment, for example, whether there is any ongoing food poisoning event in the city. Besides, more attention could be paid to food market or any other relevant field because the predictors in the current model may not have covered a range broad enough. Last but not least, more validation and analyses from the standpoint of spatial or temporal pattern could be added.
####There are some small differences in the visualizing code blocks, which does not affect the result.
####==============================================
## Exploratory Analysis
####==============================================
setwd("C:/Users/ywx12/Desktop/MUS507/finalproject/data/Rds files")
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "Black", fill=NA, size=2)
)
}
####--------------------------------------------
## Load Data
####--------------------------------------------
library(lubridate)
dat <- readRDS("dat_model2.Rds")
## Only keep "Retail Food Establishment"
dat <- dat[LICENSE_DESCRIPTION == "Retail Food Establishment"]
## Remove License Description
dat[ , LICENSE_DESCRIPTION := NULL]
dat <- na.omit(dat)
## Add criticalFound variable to dat:
dat[ , criticalFound := pmin(1, criticalCount)]
## Set the key for dat
setkey(dat, Inspection_ID)
dat$month<-as.numeric(month(dat$Inspection_Date))
dat$season=ifelse(dat$month>=9 & dat$month<=11,3,
ifelse(dat$month>=3 & dat$month<=5,1,
ifelse(dat$month>=6 & dat$month<=8,2,
4)))
####-------------------------------------------------------------
## Create Model Data
####-------------------------------------------------------------
xmat <- dat[ , list(
pastSerious = pmin(seriousCount, 1),
pastCritical = pmin(criticalCount, 1),
timeSinceLast,
ageAtInspection = ifelse(ageAtInspection > 4, 1L, 0L),
consumption_on_premises_incidental_activity,
tobacco,
heat_burglary = pmin(heat_burglary, 70),
heat_sanitation = pmin(heat_sanitation, 70),
heat_garbage = pmin(heat_garbage, 50),
season,
Facility_Type,
criticalFound,
Results,
fail_flag,
#Inspection_ID,
License,
Latitude,
Longitude),
keyby = Inspection_ID]
names(xmat)
mm <- model.matrix(criticalFound ~ . -1, data=xmat[ , -1, with=F])
mm <- as.data.table(mm)
####-----------------------------------------------------------
## Tabulation of the dependent variable
####-----------------------------------------------------------
FAIL.tab<-table(dat$fail_flag)
table(FAIL.tab)#absolute counts
prop.table(FAIL.tab)#proportion
####------------------------------------------------------------
## Distribution map of the dependent variable
####------------------------------------------------------------
library(leaflet)
library(ggmap)
library(dplyr)
pal <- colorNumeric(
palette = "Spectral",
domain = dat$fail_flag)
fitmap <- leaflet() %>% setView(lng = -87.674904, lat = 41.846241, zoom = 11)
fitmap %>%
addProviderTiles(providers$Stamen.Toner) %>%
addCircles(data = dat, weight = 1.5, color = ~pal(fail_flag))
####------------------------------------------------------------
## Cross tabulation of binary & categorical variables
####------------------------------------------------------------
library(aod)
library(ggplot2)
library(rms)
library(gmodels)
library(sf)
cross_pasterious<-CrossTable(dat$fail_flag,dat$pastSerious, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_pastcritical<-CrossTable(dat$fail_flag,dat$pastCritical, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_consumpli<-CrossTable(dat$fail_flag,dat$consumption_on_premises_incidental_activity, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_tobali<-CrossTable(dat$fail_flag,dat$tobacco, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_facility<-CrossTable(dat$fail_flag,dat$Facility_Type, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_season<-CrossTable(dat$fail_flag,dat$season, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
crosstabs<-
rbind(cross_pasterious$chisq$p.value,cross_pastcritical$chisq$p.value,cross_consumpli$chisq$p.value,
cross_tobali$chisq$p.value,cross_facility$chisq$p.value,cross_season$chisq$p.value)%>%
as.data.frame()
cat_variablenames <- list("Count of past serious violations","Count of past critical violations",
"Whether holding consumption activity license","Whether holding a tobacco license",
"Facility type","Season of the inspection")
crosstabs$variablenames<-cat_variablenames
crosstabs<-
cbind(crosstabs$variablenames,crosstabs$V1)%>%
as.data.frame()
colnames(crosstabs)<-list("Predictor","p-value for Chi squre test")
crosstabs$Significance<-{"***"}
crosstabs
####---------------------------------------------------------------
## Mean & standard deviation of continuous variables
####---------------------------------------------------------------
con_variablenames<-list("Sanitation complaint density","",
"Garbage cart removal density","",
"Crime density","",
"Elapse time since last visit","",
"Age of business license","")
tab_sanitaiton<-
cbind(tapply(dat$heat_sanitation,dat$fail_flag,mean),
tapply(dat$heat_sanitation,dat$fail_flag,sd),
(t.test(dat$heat_sanitation~dat$fail_flag))$p.value)%>%
as.data.frame()
tab_garbage<-
cbind(tapply(dat$heat_garbage,dat$fail_flag,mean),
tapply(dat$heat_garbage,dat$fail_flag,sd),
(t.test(dat$heat_garbage~dat$fail_flag))$p.value)%>%
as.data.frame()
tab_crime<-
cbind(tapply(dat$heat_burglary,dat$fail_flag,mean),
tapply(dat$heat_burglary,dat$fail_flag,sd),
(t.test(dat$heat_burglary~dat$fail_flag))$p.value)%>%
as.data.frame()
tab_time<-
cbind(tapply(dat$timeSinceLast,dat$fail_flag,mean),
tapply(dat$timeSinceLast,dat$fail_flag,sd),
(t.test(dat$timeSinceLast~dat$fail_flag))$p.value)%>%
as.data.frame()
tab_ageofbiz<-
cbind(tapply(dat$ageAtInspection,dat$fail_flag,mean),
tapply(dat$ageAtInspection,dat$fail_flag,sd),
(t.test(dat$ageAtInspection~dat$fail_flag))$p.value)%>%
as.data.frame()
tab_preds<-
rbind(tab_sanitaiton,tab_garbage,tab_crime,tab_time,tab_ageofbiz)%>%
as.data.frame()
tab_preds$predname<-con_variablenames
tab_preds$dep_val<-list(0,1,0,1,0,1,0,1,0,1)
tab_preds<-
cbind(tab_preds$predname,tab_preds$dep_val,tab_preds$V1,tab_preds$V2,tab_preds$V3)%>%
as.data.frame()
colnames(tab_preds)<-list("Predictor","Value of Dependent Variable","Mean Value","Standard Deviation","p-value of t test")
tab_preds$Significance<-
ifelse(tab_preds$`p-value of t test`<=0.05,"***","")
tab_preds
####-----------------------------------------------------------------
## Pearson correlation among the variables: check multicollinearity
####-----------------------------------------------------------------
library(corrplot)
cordat<-dplyr::select(xmat,-Inspection_ID,-Facility_Type,-criticalFound,-Results,-License,-Latitude,-Longitude)
colnames(cordat)<-c("PastSerious","PastCritical","TimeSinceLast","LicenseAge","ConsumptLicense",
"TobaccoLicense","CrimeDensity","SanitationComplaints","GarbageRemoval","Season",
"Fail_flag")
corcordat<-cor(cordat)
corrplot(corcordat,method="color", type="upper", tl.pos = "td",
#method="circle",
order="hclust",
tl.cex = 0.6,tl.col = "black")
####===========================================================
## Model
####===========================================================
iiTrain <- dat[ , which(Inspection_Date < "2017-01-01")]
iiTest <- dat[ , which(Inspection_Date > "2017-01-01")]
training<-xmat[iiTrain,]
test<-xmat[-iiTrain,]
reglog<-glm(fail_flag ~ pastSerious+pastCritical+ageAtInspection+consumption_on_premises_incidental_activity+tobacco+heat_burglary+heat_sanitation+heat_garbage+Facility_Type+timeSinceLast+season, data = test,family="binomial")
summary(reglog)
#predtr<-predict(reglog,training)
#predtr<-predict(reglog,test)
predtr<-fitted(reglog,test)
#
predtr <-
data.frame(Results = test$ Results,
observedfail = test$fail_flag,
predictedfail = exp(predtr),
Latitude = test$Latitude,
Longitude = test$Longitude)
####-----------------------------------------------------------
## Find the cut-off value
####-----------------------------------------------------------
library(ROCR)
library(nnet)
#Goodness of Fit
fit<-reglog$fitted
hist(fit)
#ROC Curve
a <- cbind(predtr, fit)
#colnames(a) <- c("Results","fail_flag","predr","predictions")
head(a)
roc <- as.data.frame(a)
pred <- prediction(roc$fit, roc$observedfail)
#Below, tpr = true positive rate, another term for sensitivity
#fpr = false positive rate, or 1-specificity
roc.perf = performance(pred, measure = "tpr", x.measure="fpr")
plot(roc.perf)
abline(a=0,b=1)
opt.cut = function(perf, pred){
cut.ind = mapply(FUN=function(x, y, p){
d = (x - 0)^2 + (y-1)^2
ind = which(d == min(d))
c(sensitivity = y[[ind]], specificity = 1-x[[ind]],
cutoff = p[[ind]])
}, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(roc.perf, pred))
auc.perf = performance(pred, measure ="auc")
auc.perf@y.values
head(a)
####----------------------------------------------------------
## Map of result
####----------------------------------------------------------
library(magrittr)
library(leaflet)
pal2 <- colorNumeric(
palette = "Spectral",
domain = a$fit)
fitmap2 <- leaflet() %>% setView(lng = -87.674904, lat = 41.846241, zoom = 11)
fitmap2 %>%
addProviderTiles(providers$Stamen.Toner) %>%
addCircles(data = a, weight = 1.5, color = ~pal2(fit))
### Risk score map
library(ggmap)
library(dplyr)
ChicagoBase<-get_stamenmap(bbox = c(left = -87.852725, bottom = 41.645267, right = -87.556860,
top = 42.032880), zoom = 13, maptype = c("toner"), crop = FALSE, messaging = FALSE,urlonly = FALSE, force = TRUE)
mapfit<-ggmap(ChicagoBase)+
geom_point(data = a,
aes(x=Longitude, y=Latitude, color=factor(ntile(fit,5))),
size = 1) +
labs(title="Chicago Food Inspection Risk Score") +
scale_colour_manual(values = c("#1a9641","#a6d96a","#ffffbf","#fdae61","#d7191c"),
labels=as.character(quantile(a$fit,
c(.1,.2,.4,.6,.8),na.rm=T)),
name="Risk Score\n (Quintile Breaks)") +
mapTheme()
mapfit<-mapfit +
guides(color=guide_legend(override.aes = list(size=4)))
mapfit
####=============================================================
## Model evaluation
####=============================================================
####-------------------------------------------------------------
## Absolute values of the coefficients
####-------------------------------------------------------------
standardized <- as.data.frame(coef(reglog))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized
#ggplot(standardized, aes(x=variable, y=abs(std_coefficient), fill=variable)) +
# geom_bar(stat="identity")
standardized$absCoef <- abs(standardized$std_coefficient)
ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
####-----------------------------------------------------------
## Accuracy: Misclassification rate & AUC
####-----------------------------------------------------------
## Error map
b<-a %>%
mutate(ResultIndex = Results) %>%
mutate(fitIndex = fit)
levels(b$ResultIndex)
b$ResultIndex <- as.character(b$ResultIndex)
b$ResultIndex[b$ResultIndex == "Pass"] <- "0"
b$ResultIndex[b$ResultIndex == "Pass w/ Conditions"] <- "0"
b$ResultIndex[b$ResultIndex == "Not Ready"] <- ""
b$ResultIndex[b$ResultIndex == "Fail"] <- "1"
b$ResultIndex <- as.numeric(b$ResultIndex)
b$fitIndex <- as.numeric(b$fitIndex)
b$fitIndex[b$fitIndex >= "0.4184682"] <- "1"
b$fitIndex[b$fitIndex < "0.4184682"] <- "0"
b$fitIndex <- as.numeric(b$fitIndex)
b<-b%>%
mutate(absError = abs(fitIndex-ResultIndex))%>%
mutate(absError = as.factor(absError))
ChicagoBase<-get_stamenmap(bbox = c(left = -87.940572, bottom = 41.650391, right = -87.556860,
top = 42.032880), zoom = 13, maptype = c("toner"), crop = FALSE, messaging = FALSE,urlonly = FALSE, force = TRUE)
mapError<-ggmap(ChicagoBase)+
geom_point(data = b,
aes(x= Longitude, y= Latitude, color=absError),
size = 1) +
labs(title="Chicago Food Inspection Error Map") +
scale_colour_manual(values = c("0" = "#0571b0","1" = "#ca0020","NA" = ""),
labels=c("Correct","Wrong","NA"),
name="Error") +
mapTheme()
mapError<-mapError +
guides(color=guide_legend(override.aes = list(size=4)))
mapError
## AUC
auc.perf@y.values
## Misclassification rate
optval<-opt.cut(roc.perf,pred)[3,1]
optsens<-opt.cut(roc.perf,pred)[1,1]
optspec<-opt.cut(roc.perf,pred)[2,1]
fit.binary=(fit>=optval)
cross_acc<-CrossTable(fit.binary, test$fail_flag,prop.r = FALSE,prop.t = FALSE,prop.chisq = FALSE)
#misclassification = (196+1444)*100/12081
# I can write the result outside the code block.
####----------------------------------------------------------
## Generalizability: cross validation
####----------------------------------------------------------
library(caret)
scaleFunc<-function(x)sprintf("%.3f",x)
fitControl<-
trainControl(method = 'cv', number = 100)
set.seed(825)
lmFit<-train(fail_flag~pastSerious+pastCritical+ageAtInspection+consumption_on_premises_incidental_activity+tobacco+heat_burglary+heat_sanitation+heat_garbage+Facility_Type+timeSinceLast+season, data = training,family="binomial",
method = 'glm',
trControl = fitControl)
#==================================RMSE==================================
meanRMSE=toString(scaleFunc(mean(lmFit$resample[,1])))
sdRMSE=toString(scaleFunc(sd(lmFit$resample[,1])))
strmean_1='The mean RMSE of the cross validation result is:'
strsd_1='The standard deviation of RMSE of the cross validation result is:'
strmean<-paste(strmean_1,meanRMSE,collapse = '')
strsd<-paste(strsd_1,sdRMSE,collapse = '')
cat(strmean,"\n")
cat(strsd,"\n")
ggplot(as.data.frame(lmFit$resample), aes(RMSE)) +
geom_histogram(bins=20) +
labs(x="RMSE", y="Count",
title='Histogram of RMSE of 100-fold cross-validation')
#=================================MAE=======================================
meanMAE=toString(scaleFunc(mean(lmFit$resample[,3])))
sdMAE=toString(scaleFunc(sd(lmFit$resample[,3])))
strmean_2='The mean MAE of the cross validation result is:'
strsd_2='The standard deviation of MAE of the cross validation result is:'
strmean1<-paste(strmean_2,meanMAE,collapse = '')
strsd1<-paste(strsd_2,sdMAE,collapse = '')
cat(strmean1,"\n")
cat(strsd1,"\n")
ggplot(as.data.frame(lmFit$resample), aes(MAE)) +
geom_histogram(bins=20) +
labs(x="MAE", y="Count",
title='Histogram of MAE of 100-fold cross-validation')
And thanks for reading!
Any suggestion or comment would be of great help to our future improvment.